\[ \definecolor{mathBlack}{RGB}{0,0,0} \definecolor{mathOrange}{RGB}{253, 126, 20} \definecolor{mathLightGreen}{RGB}{32, 201, 151} \definecolor{mathGreen}{RGB}{24, 188, 156} \definecolor{mathYellow}{RGB}{253, 156, 18} \definecolor{c}{RGB}{52, 152, 219} \definecolor{mathRed}{RGB}{231, 76, 60} \definecolor{mathPurple}{RGB}{111, 66, 193} \]
Descriptive Statistics
Agenda
- Frequency Distributions
- Central Tendency
- Variability
Readings
- Being a Statistician Means Never Having to Say You’re Certain
Frequency Distributions
- FREQUENCY DISTRIBUTION
- RELATIVE FREQUENCY DISTRIBUTION
- PROPORTION
- PERCENTAGE
- CUMULATIVE
- RATE
- BAR GRAPH
- HISTOGRAM
- LINE GRAPH
- STATISTICAL MAP
Objectives
- Calculate proportions and percentages
- Construct and analyze frequency, percentage, and cumulative distributions
gss_all$premarsx <- as_factor(zap_missing(gss_all$premarsx))
gss_all$sex <- as_factor(zap_missing(gss_all$sex))
freq_premarsx <- gss_all %>%
select(id, year, sex, premarsx) %>%
filter(year == 2024, !is.na(premarsx)) %>%
count(premarsx)
total_row <- freq_premarsx %>%
summarise(across(where(is.numeric), sum)) %>%
mutate(premarsx = "Total")
# combine
table_premarsx <- rbind(freq_premarsx, total_row)Table 1. Attitudes about sex before marriage
# Render the table
table_premarsx %>%
flextable() %>%
style_flextable()premarsx | n |
|---|---|
always wrong | 357 |
almost always wrong | 122 |
wrong only sometimes | 258 |
not wrong at all | 1,378 |
Total | 2,115 |
Survey question: There’s been a lot of discussion about the way morals and attitudes about sex are changing in this country. If a man and woman have sex relations before marriage, do you think it is _________.
Table 1. Attitudes about sex before marriage
table_premarsx %>%
flextable() %>%
style_flextable() %>%
color(color = "#E74C3C", i = 5, j = "n")premarsx | n |
|---|---|
always wrong | 357 |
almost always wrong | 122 |
wrong only sometimes | 258 |
not wrong at all | 1,378 |
Total | 2,115 |
The number of respondents who answered this survey question.
Table 1. Attitudes about sex before marriage
table_premarsx %>%
flextable() %>%
style_flextable() %>%
color(color = "#E74C3C", i = 3, j = "n")premarsx | n |
|---|---|
always wrong | 357 |
almost always wrong | 122 |
wrong only sometimes | 258 |
not wrong at all | 1,378 |
Total | 2,115 |
The number of respondents who said pre-marital sex was “wrong only sometimes.”
Source: U.S. General Social Survey 2024
Are women more likely than men to say premarital sex is “not wrong at all”?
Table 2. Attitudes about sex before marriage by gender
gss_all$premarsx <- as_factor(zap_missing(gss_all$premarsx))
gss_all$sex <- as_factor(zap_missing(gss_all$sex))
freq_premarsx <- gss_all %>%
select(id, year, sex, premarsx) %>%
filter(!is.na(premarsx), !is.na(sex)) %>%
group_by(sex) %>%
count(premarsx) %>%
pivot_wider(
names_from = sex,
values_from = n,
values_fill = 0
)
# create total row
total_row <- freq_premarsx %>%
summarise(across(where(is.numeric), sum)) %>%
mutate(premarsx = "Total") %>%
select(names(freq_premarsx)) # ensure column order matches
# combine
table_premarsx <- rbind(freq_premarsx, total_row)
# Get number of rows
n_rows <- nrow(table_premarsx) # fixing transparency
# Render the table
table_premarsx %>%
flextable() %>%
style_flextable() %>%
# Manually add zebra stripes with solid colors w/o transparency
bg(i = seq(1, n_rows, by = 2), bg = "white", part = "body") %>%
bg(i = seq(2, n_rows, by = 2), bg = "#F2F2F2", part = "body")premarsx | male | female |
|---|---|---|
always wrong | 4,159 | 7,116 |
almost always wrong | 1,499 | 2,388 |
wrong only sometimes | 3,904 | 4,792 |
not wrong at all | 10,672 | 11,086 |
Total | 20,234 | 25,382 |
[Source: U.S. General Social Survey 1972-2024]
Proportions are between 0 and 1.0.
Proportion = count (f) / total number of cases (N).
Percentages are between 0 and 100.
Percentage = proportion × 100.
gss_all$premarsx <- droplevels(gss_all$premarsx)
# Create frequency & proportions table
tab <- gss_all %>%
filter(!is.na(premarsx), !is.na(sex)) %>%
group_by(sex, premarsx) %>%
summarise(n = n(), .groups = "drop") %>%
group_by(sex) %>%
mutate(percent = round(100 * n / sum(n), 0)) %>%
ungroup() %>%
pivot_wider(
names_from = sex, values_from = c(n, percent)
)
# Add totals row
tab_totals <- tab %>%
summarise(across(where(is.numeric), sum, na.rm = TRUE)) %>%
mutate(premarsx = "Total")Warning: There was 1 warning in `summarise()`.
ℹ In argument: `across(where(is.numeric), sum, na.rm = TRUE)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.
# Previously
across(a:b, mean, na.rm = TRUE)
# Now
across(a:b, \(x) mean(x, na.rm = TRUE))
# Combine with original table
tab_with_totals <- bind_rows(tab, tab_totals)Table 2. Attitudes about sex before marriage by gender
## Pretty table
tab_with_totals %>%
select(premarsx, n_male, percent_male, n_female, percent_female) %>%
flextable() %>%
style_flextable() %>%
set_header_labels(
n_male = "n", percent_male = "%",
n_female = "n", percent_female = "%"
) %>%
add_header_row(
values = c("", "Men", "Women"),
colwidths = c(1, 2, 2)
) %>%
align(j = c(2, 3, 4, 5), align = "center", part = "all") %>%
color(color = "#18bc9c", i = 4, j = 4) %>%
color(color = "#fd7e14", i = 5, j = 4) %>%
color(color = "#e74c3c", i = 4, j = 5)Men | Women | |||
|---|---|---|---|---|
premarsx | n | % | n | % |
always wrong | 4,159 | 21 | 7,116 | 28 |
almost always wrong | 1,499 | 7 | 2,388 | 9 |
wrong only sometimes | 3,904 | 19 | 4,792 | 19 |
not wrong at all | 10,672 | 53 | 11,086 | 44 |
Total | 20,234 | 100 | 25,382 | 100 |
\(\frac{\color{mathGreen}{11{,}086}}{\color{mathOrange}{25{,}382}} = 0.4367 \times 100 = \color{mathRed}{43.7\%}\)
TIP: Total of a % column should always sum to 100!
Table 2. Attitudes about sex before marriage by gender
## Pretty table
tab_with_totals %>%
select(premarsx, n_male, percent_male, n_female, percent_female) %>%
flextable() %>%
style_flextable() %>%
set_header_labels(
n_male = "n", percent_male = "%",
n_female = "n", percent_female = "%"
) %>%
add_header_row(
values = c("", "Men", "Women"),
colwidths = c(1, 2, 2)
) %>%
align(j = c(2, 3, 4, 5), align = "center", part = "all") %>%
color(color = "#E74C3C", i = 4, j = 3) %>%
color(color = "#E74C3C", i = 4, j = 5)Men | Women | |||
|---|---|---|---|---|
premarsx | n | % | n | % |
always wrong | 4,159 | 21 | 7,116 | 28 |
almost always wrong | 1,499 | 7 | 2,388 | 9 |
wrong only sometimes | 3,904 | 19 | 4,792 | 19 |
not wrong at all | 10,672 | 53 | 11,086 | 44 |
Total | 20,234 | 100 | 25,382 | 100 |
A greater proportion of men (53%) than women (44%) say premarital sex is “not wrong at all.”
Source: U.S. General Social Survey 1972-2024
Table 3. Attitudes about sex before marriage, with cumulative percentages
gss_all$premarsx <- droplevels(gss_all$premarsx)
# Create frequency & proportions table
tab <- gss_all %>%
filter(year == 2024, !is.na(premarsx)) %>%
group_by(premarsx) %>%
summarise(n = n(), .groups = "drop") %>%
mutate(
percent = round(100 * n / sum(n), 0),
cum_percent = round(cumsum(percent), 0)
) %>%
ungroup()
# Add totals row
tab_totals <- tab %>%
summarise(across(c(n, percent), sum, na.rm = TRUE)) %>%
mutate(premarsx = "Total")
# Combine with original table
tab_with_totals <- bind_rows(tab, tab_totals)
## Pretty table
tab_with_totals %>%
flextable() %>%
style_flextable() %>%
set_header_labels(
n = "n", percent = "%",
cum_percent = "cumulative %"
) %>%
color(color = "#18bc9c", i = 1, j = 3) %>%
color(color = "#fd7e14", i = 2, j = 3) %>%
color(color = "#e74c3c", i = 2, j = 4)premarsx | n | % | cumulative % |
|---|---|---|---|
always wrong | 357 | 17 | 17 |
almost always wrong | 122 | 6 | 23 |
wrong only sometimes | 258 | 12 | 35 |
not wrong at all | 1,378 | 65 | 100 |
Total | 2,115 | 100 |
\({\color{mathGreen} 17} + {\color{mathOrange} 6} = {\color{mathRed} 23\%}\)
Source: U.S. General Social Survey 2024
Examples:
- Canada’s divorce rate decreased from 12.7 per 1,000 in 1991 to 5.6 per 1,000 in 2020.
- The 2021 suicide rate of 14.8 per 100,000 population for middle aged Canadians (30-59 years old) was the highest of any age group.
- Canada’s total fertility rate reached a new low in 2023 of 1.26 children per woman.
Nominal variables:
can have frequency distributions, cannot have cumulative frequency distributions
Ordinal:
can have frequency distributions and cumulative frequency distributions
Interval-ratio:
can have frequency distributions, cumulative frequency distributions, and rates
A bar graph is used:
for nominal or ordinal variables,
to show frequencies or percentages,
using separated rectangles, with height proportional
to the frequency or percentage.
A histogram is used:
for interval-ratio variables,
to show frequencies or percentages,
using separated rectangles, with height proportional
to the frequency or percentage.
A line graph is used:
for interval-ratio variables,
to show frequencies or percentages,
joining by category the frequency or average with a line.
A statistical map is used:
for interval-ratio variables,
to show geographical variations, often in ratios,
using variation in color or hue.
Central Tendency
- MEAN
- MEDIAN
- MODE
- OUTLIER
- PERCENTILE
- BIMODAL
- SYMMETRICAL DISTRIBUTION
- POSITIVELY SKEWED DISTRIBUTION
- NEGATIVELY SKEWED DISTRIBUTION
Objectives
- Explain the importance of measures of central tendency.
- Calculate and interpret the mean, the median, and the mode.
- Identify the relative strengths and weaknesses of the three measures.
- Determine and explain the shape of a distribution.
Summary Statistics
We use summary statistics to find out what is TYPICAL in a distribution.
- most commonly used measure of central tendency,
- it’s weakness is that it is sensitive to outliers (extreme scores in a distribution)
Finding the mean in a list: \(7, 4, 2, 8, 0, 9, 5\)
- Add all observations together: \(7 + 4 + 2 + 8 + 0 + 9 + 5 = 35\)
- Divide the sum by the number of observations: \(\frac{35}{7} = 5\)
| Family ID | Annual Income (CAD) |
|---|---|
| F01 | $48,000 |
| F02 | $52,000 |
| F03 | $45,000 |
| F04 | $50,000 |
| F05 | $53,000 |
| F06 | $49,000 |
| F07 | $46,000 |
| F08 | $51,000 |
| F09 | $175,000 |
| F10 | $250,000 |
Most families in this sample earn between $45K–53K, but two high-income households push the average far above what’s typical.
Source: Totally fake data
The median is the value at the 50th percentile in a cumulative frequency distribution.
Finding the median in a list with an odd number of observations:
\(7, 2, 1, 3, 4, 1, 5, 9, 2\)
- Put the list in order: \(1, 1, 2, 2, 3, 4, 5, 7, 9\)
- Pick the center number: \(3\)
Finding the median in a list with an even number of observations:
\(2, 0, 1, 2, 5, 1, 3, 1\)
- Put the list in order: \(0, 1, 1, 1, 2, 2, 3, 5\)
- Add the two center numbers & divide by 2: \(\frac{1 + 2}{2} = 1.5\)
How often do the demands of your job interfere with your family life?
gss_all$wkvsfam <- as_factor(zap_missing(gss_all$wkvsfam))
# Create frequency & proportions table
tab <- gss_all %>%
filter(year == 2022, !is.na(wkvsfam)) %>%
group_by(wkvsfam) %>%
summarise(n = n(), .groups = "drop") %>%
mutate(
percent = round(100 * n / sum(n), 0),
cum_percent = round(cumsum(percent), 0)
) %>%
ungroup()
# Add totals row
tab_totals <- tab %>%
summarise(across(c(n, percent), sum, na.rm = TRUE)) %>%
mutate(wkvsfam = "Total")
# Combine with original table
tab_with_totals <- bind_rows(tab, tab_totals)
## Pretty table
tab_with_totals %>%
flextable() %>%
style_flextable() %>%
set_header_labels(
n = "n", percent = "%",
cum_percent = "cumulative %"
) %>%
color(color = "#e74c3c", i = 3, j = 4)wkvsfam | n | % | cumulative % |
|---|---|---|---|
often | 218 | 11 | 11 |
sometimes | 645 | 33 | 44 |
rarely | 669 | 34 | 78 |
never | 447 | 23 | 101 |
Total | 1,979 | 101 |
Source: U.S. General Social Survey 2022
Finding the mode in a list:
\(7, 2, 1, 3, 4, 1, 5, 1, 2\)
- Put the list in order: \(1, 1, 1, 2, 2, 3, 4, 5, 7\)
- Pick the most frequent number: \(1\)
Table 01. Most of the time people…
gss_all$helpful <- as_factor(zap_missing(gss_all$helpful))
# Create frequency & proportions table
tab <- gss_all %>%
filter(year == 2024, !is.na(helpful)) %>%
group_by(helpful) %>%
summarise(n = n(), .groups = "drop") %>%
mutate(
percent = round(100 * n / sum(n), 0),
cum_percent = round(cumsum(percent), 0)
) %>%
ungroup()
# Add totals row
tab_totals <- tab %>%
summarise(across(c(n, percent), sum, na.rm = TRUE)) %>%
mutate(helpful = "Total")
# Combine with original table
tab_with_totals <- bind_rows(tab, tab_totals)
## Pretty table
tab_with_totals %>%
flextable() %>%
style_flextable() %>%
set_header_labels(
n = "n", percent = "%",
cum_percent = "cumulative %"
) %>%
color(color = "#e74c3c", i = 2, j = 2) %>%
color(color = "#e74c3c", i = 2, j = 3)helpful | n | % | cumulative % |
|---|---|---|---|
try to be helpful | 365 | 39 | 39 |
looking out for themselves | 440 | 47 | 86 |
depends | 137 | 15 | 101 |
Total | 942 | 101 |
Source: U.S. General Social Survey 2024
TIP: MOST respondents said others are “looking out for themselves”
TIP: A bimodal distribution has two distinct humps, even if the peaks aren’t exactly the same height.
Choosing a Measure
The MODE is appropriate for nominal and ordinal variables.
It can be identified for interval-ratio level variables, but is often not useful.
The MEDIAN is appropriate for interval-ratio and ordinal variables.
It cannot be used for nominal level variables.
The MEAN can ONLY be determined for interval-ratio variables.
Distribution Shapes
Positively Skewed Distribution
Negatively Skewed Distribution
Variability
- RANGE
- INTERQUARTILE RANGE
- VARIANCE
- STANDARD DEVIATION
Objectives
- Explain the importance of measuring variability.
- Calculate and interpret the range, interquartile range, variance, and standard deviation.
- Identify the relative strengths and weaknesses of the measures.
Measures of Variability
Describe the diversity in a distribution, for interval-ratio variables.
They reveal how spread out the values in your dataset are.
Range
- The strength of the range is that it is easy to calculate and simple to understand.
- The weakness of the range is that it is based only on the lowest and the highest scores, which could be atypical and therefore it may be misleading.
Finding the range in a list: \(26, 23, 28, 27, 24, 25, 32, 25, 28, 25, 25, 26, 27, 26, 27, 25\)
- Put the list in order: \({\color{mathBlue} 23}, 24, 25, 25, 25, 25, 25, 26, 26, 26, 27, 27, 27, 28, 28, {\color{mathRed} 32}\)
- Subtract the min from the max: \({\color{mathRed} 32} - {\color{mathBlue} 23} = 9\)
Interquartile Range
IQR in a list with an odd number of observations:
\(2, 3, 3, 4, 4, 6, 7, 7, 7, 8, 9, 11, 12\)
- Q1 is the median of the numbers below the median: \({\color{mathOrange} 2, 3, 3, 4, 4, 6,} {\color{mathBlue} 7},
7, 7, 8, 9, 11, 12\) (\(\frac{3 + 4}{2}\)) \(= {\color{mathOrange} 3.5}\)
- Q3 is the median of the numbers above the median: \(2, 3, 3, 4, 4, 6, {\color{mathBlue} 7},
{\color{mathRed}7, 7, 8, 9, 11, 12}\) (\(\frac{8 + 9}{2}\)) \(= {\color{mathRed} 8.5}\)
- Subtract Q1 from Q3: \({\color{mathRed} 8.5} - {\color{mathOrange} 3.5} = 5\)
IQR in a list with an even number of observations:
\(3, 4, 5, 7, 9, 10, 11, 13\)
- Q1 is the median of the numbers below the median: \({\color{mathOrange} 3, 4, 5, 7, }
9, 10, 11, 13\) (\(\frac{4 + 5}{2}\)) \(= {\color{mathOrange} 4.5}\)
- Q3 is the median of the numbers above the median: \(3, 4, 5, 7,
{\color{mathRed}9, 10, 11, 13}\) (\(\frac{10 + 11}{2}\)) \(= {\color{mathRed} 10.5}\)
- Subtract Q1 from Q3: \({\color{mathRed} 10.5} - {\color{mathOrange} 4.5} = 6\)
TIP: The median of the list is \(8\).
Standard Deviation
Along the way to calculating the standard deviation, you calculate the variance of a distribution.
Standard Deviation Forumulas
Note there are two formula’s for calculating the standard deviation (and variance). Technically, instead of summing the “squared deviations from the mean” and dividing the sum from the total number of observations, when using a sample, statisticians divide by the total number of observations minus one.
Once the sample sizes become large enough, there’s a negligible difference between the two. In this course, we’re not going to worry about the difference. Always use the population variance and standard deviation equations.
Standard Deviation in 5 Steps
- Calculate the mean.
- Subtract the mean from every value (deviation from the mean).
- Square each “deviation from the mean.”
- Calculate the mean of the squared “deviations from the mean.”
- Take the square root of this new mean!
TIP: The mean of the squared “deviations from the mean” is the variance!
\(2,3,4,7,9\)
1. Calculate the mean \(\bar{X}\)
\(\frac{2 + 3 + 4 + 7 +9}{5} = 5\)
\(2,3,4,7,9\)
2. Subtract the mean (\(\bar{X}\)) from every value (\(X\))
3. Square each difference
4. Calculate the mean of the squares
\(\frac{9 + 4 +1 + 4 +16}{5} = 6.8\)
TIP: 6.8 is known as the variance!
5. Take the square root of the variance
\(\sqrt{6.8} = 2.6\)
I expect the average [whatever you’re studying] to differ by [your standard deviation] from the mean.
Example: Mean: 5; SD: 2.6
I expect the average [number of household family members] to differ by [2.6 people] from the mean [of 5 people per household].
Knowledge Check
What is the best measure of central tendency for the racial composition of students at UofT?
- Mode
- Median
- Mean
What is the best measure of central tendency for the distance traveled to work each day by Toronto residents?
- Mode
- Median
- Mean
What is the best measure of central tendency for the age at marriage?
- Mode
- Median
- Mean
What is the likely distribution of age at first childbirth?
- Symmetrical
- Positively Skewed
- Negatively skewed
What is the median of the table on the next slide?
- almost daily
- once or twice a week
- several times a month
- about once a month
Knowledge Check Table
How often do you spend a social evening with relatives?
gss_all$socrel <- as_factor(zap_missing(gss_all$socrel))
# Create frequency & proportions table
tab <- gss_all %>%
filter(year >= 2022, !is.na(socrel)) %>%
group_by(socrel) %>%
summarise(n = n(), .groups = "drop") %>%
mutate(
percent = round(100 * n / sum(n), 0),
cum_percent = round(cumsum(percent), 0)
) %>%
ungroup()
# Add totals row
tab_totals <- tab %>%
summarise(across(c(n, percent), ~sum(.x, na.rm = TRUE))) %>%
mutate(socrel = "Total")
# Combine with original table
tab_with_totals <- bind_rows(tab, tab_totals)
## Pretty table
tab_with_totals %>%
flextable() %>%
style_flextable() %>%
set_header_labels(
n = "n", percent = "%",
cum_percent = "cumulative %") socrel | n | % | cumulative % |
|---|---|---|---|
almost daily | 527 | 12 | 12 |
once or twice a week | 922 | 20 | 32 |
several times a month | 845 | 19 | 51 |
about once a month | 688 | 15 | 66 |
several times a year | 910 | 20 | 86 |
about once a year | 368 | 8 | 94 |
never | 279 | 6 | 100 |
Total | 4,539 | 100 |
Source: U.S. General Social Survey 2022-2024
Knowledge Check
Top Home Runs in 2022
\(62, 46, 40, 40, 38\)
What’s the standard deviation?
N = 5
MEAN = 45.2
VARIANCE = 77.76
SD = 8.82
Knowledge Check
Women’s Basketball Points Leaders 2022 \(29.1, 27.4, 26.9, 25.2, 23.2\)
What’s the standard deviation?
N = 5
MEAN = 26.36
VARIANCE = 4.04
SD = 2.01
Is there more variability in top home run leaders or top point earners in women’s basketball?
- Home runs, because the standard deviation is smaller.
- Womn’s points, because the standard deviation is smaller.
- Home runs, because the standard deviation is larger.
- We can’t compare between sports.
Hint: Compare the SD from the previous two slides!
Support
Overview of the measures of central tendency
Difference between the mean, median, and mode
Overview of the measures of variability
Calculate measures of variability